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Embedded software: task scheduling and compilation: Register aware scheduling for 
distributed cache clustered architecture 
Zhong Wang, Xiaobo Sharon Hu, Edwin H.-M. Sha 

January 2003 Proceedings of the 2003 conference on Asia South Pacific design 
automation ASPDAC 

Publisher: ACM Press 

Full text available: ^ pdfd 17.33 KB) Additional Information: full citation , abstract , references 

Increasing wire delays have become a serious problem for sophisticated VLSI designs. 
Clustered architecture offers a promising alternative to alleviate the problem. In the 
clustered architecture, the cache, register file and function units are all partitioned into 
clusters such that short CPU cycle time can be achieved. A key challenge is the 
arrangement of inter-cluster communication. In this paper, we present a novel algorithm 
for scheduling inter-cluster communication operations. Our algorith ... 
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Publisher: IEEE Computer Society 

Additional Information: full citation , abstract , references , citing s, index 
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Full text available: g pdfn .44 MB) 



Software Pipelining is a loop scheduling technique that extracts parallelism from loops by 
overlapping the execution of several consecutive iterations. There has been a significant 
effort to produce throughput-optimal schedules under resource constraints, and more 
recently to produce throughput-optimal schedules with minimunn register requirements. 
Unfortunately even a throughput-optimal schedule with minimum register requirements is 
useless if it requires more registers than those available in t ... 

Dependences and register allocation: Prematerialization: reducing register pressure 
for free 

Ivan D. Baev, Richard E. Hank, David H. Gross 
September 2006 Proceedings of the 15th international conference on Parallel 
architectures and compilation techniques PACT '06 

Publisher: ACM Press 

Full text available: "^pdf d 59.54 KB) Additional Information: full citation , abstra ct, references, in dex t erms 

Modern compiler transformations that eliminate redundant computations or reorder 
instructions, such as partial redundancy elimination and instruction scheduling, are very 
effective in improving application performance but tend to create longer and potentially 
more complex live ranges. Typically the task of dealing with the increased register 
pressure is left to the register allocator. To avoid introduction of spill code which can 
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^ Parallelization of loops with exits on pi pelined architectures 
p. Tirumaiai, M. Lee, 1^. Schiansker 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 
Supercomputing '90 

Publisher: IEEE Computer Society 

Full text available: pdf(1.20 MB) Additional Information: full citation , abstract , references , citings 

Modulo scheduling theory can be applied successfully to overlap Fortran DO loops on 
pipelined computers issuing multiple operations per cycle both with and without special 
loop architectural support [I, 2, 3]. This paper shows that a broader class of loops - 
repeat-until, while, and loops with more than one exit - where the trip count is not known 
beforehand, can also be overlapped efficiently on multiple issue pipelined machines. 
Special features that are required In the architecture, as ... 

Keywords: Conditional execution, dependence graphs, loop scheduling, modulo 
scheduling, performance bounds, pipelined architectures, software pipelining, while loops 



5 Code schedulin g : Phi-Predication for lig ht-wei g ht if -conversion 
Weihaw Chuang, Brad Calder, Jeanne Ferrante 

March 2003 Proceedings of the international symposium on Code generation and 

optimization: feedbacl<-directed and runtime optimization CGO '03 
Publislier: IEEE Computer Society 

Full text available: "SJ pdf(1 19 MB) Additional Information: full citation , abstract , references , citing s, index 

terms 

Predicated execution can elinninate hard to predict branches and help to enable instruction 
level parallelism. Many current predication variants exist where the result update is 
conditional based upon the outcome of the guarding predicate. However, conditional 
writing of a register creates a naming problem for an out-of-order processor, and can stall 
the issuing of instructions. This problem arises from potential multiple predicated 
definitions reaching a use, which Is unresolved until the prior ... 

6 Allocating architected registers through differential encoding 
Xiaotong Zhuang, Santosh Pande 

April 2007 ACI^ Transactions on Programming Languages and Systems (TOPLAS), 

Volume 29 Issue 2 
Publisher: ACM Press 

Full text available: ^ pdf(312.02 KB ) Additional Information: full citation , abstra ct, references , in dex terms 

Micro-architecture designers are very cautious about expanding the number of architected 
and exposed registers in the instruction set because increasing the register field adds to 
the code size, raises the I-cache and memory pressure, and may complicate the 
processor pipeline. Especially for low-end processors, encoding space could be extremely 
limited due to area and power considerations. On the other hand, the number of 
architected registers exposed to the compiler could directly affect the ... 

Keywords: Register allocation, architected register, differential encoding 
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Differential register allocation 
Xiaotong Zhuang, Santosh Pande 

June 2005 ACM SIGPLAN Notices , Proceedings of tlie 2005 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '05, Volume 40 
Issue 6 

Publisher: ACM Press 
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Fu!l text available: gpdf(214.10 KB) Additional Information: full citation , abstract , references . Index terms 

Micro-architecture designers are very cautious about expanding the number of architected 
registers (also the register field), because increasing the register field adds to the code 
size, raises I-cache and memory pressure, complicates processor pipeline. Especially for 
low-end processors, encoding space could be extremely limited due to area and power 
considerations. On the other hand, the number of architected registers exposed to the 
compiler could directly affect the effectiveness of compiler ... 

Keywords: architected register, differential dncoding, register allocation 



Processor microarchitecture II: P re dicate pre dictio n for eff icient out -of-ord er 
execution 

Welhaw Chuang, Brad Calder 

June 2003 Proceedings of the 17th annual international conference on 

Supercomputing ICS *03 
Publisher: ACM Press 

Full text available:^ pdf(220.52 KB) Additional Information: full citation , abstract , references , index terms 

Predicated execution is an important optimization even for an out-of-order processor, 
since it can eliminate hard to predict branches and help to enable softvyare pipelining. 
Using predication with out-of-order execution creates a naming bottleneck, because there 
can be multiple definitions reaching a use, and not knowing which use is the correct one 
can stall the processor.In this paper, we examine using predicate prediction to 
speculatively allow execution to proceed in the face of multiple def ... 

Keywords: predicate prediction, predicated execution 



Demy stifyin g on-the-fly spill code 

Alex Aleta, Josep M. Codina, Antonio Gonzalez, David Kaeli 

June 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '05, Volume 40 
Issue 6 

Publisher: ACM Press 

Full text available: "g] pdf ( 192.08 KB ) Additional Information: full citatio n, abstract, references, index terms 

Modulo scheduling is an effective code generation technique that exploits the parallelism 
in program loops by overlapping iterations. One drawback of this optimization is that 
register requirements increase significantly because values across different loop iterations 
can be live concurrently. One possible solution to reduce register pressure Is to insert spill 
code to release registers. Spill code stores values to memory between the producer and 
consumer instructions. Spilling heuristics can be ... 

Keywords: modulo scheduling, register allocation, spill code 



10 A performance analysis of PIM. stream processing, and tiled processing on menriory - Q 

intensive signal processing kemels 
^ 3inwoo Suh, Eun-Gyu Kim, Stephen P. Crago, Lakshmi Srinivasan, Matthew C. French 
May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture ISCA '03, volume 
31 Issue 2 
Publisher: ACM Press 

Full text available: ^ pdf(239.50 KB) Additional Information: full citation , abstract , refere nces , citings 

Trends in microprocessors of increasing die size and clock speed and decreasing feature 
sizes have fueled rapidly increasing performance. However, the limited improvements in 
DRAM latency and bandwidth and diminishing returns of increasing superscalar ILP and 
cache sizes have led to the proposal of new microprocessor architectures that implement 
processor-in- memory, stream processing, and tiled processing. Each architecture is 
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David A. Patterson 

September 1988 ACM SIGARCH Computer Archi tecture News, Volume 16 issue 4 
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Full text available: ^ pdfH.SSMB) Additional Information: full citation , index terms 



12 WBIA'05: Using Pin as a mennory reference generator for multiprocessor sinriulation | 
Collin McCurdy, Charles Fischer 

December 2005 ACM SIGARCH Computer Architecture News, Volume 33 issue 5 
Publisher: ACM Press 

Full text available: ^ pdf(249.21 KB) Additional Information: full citation , abstract , references , index terms 

In this paper we describe how we have used Pin to generate a multithreaded reference 
stream for simulation of a multiprocessor on a uniprocessor. We have taken special care 
to model as accurately as possible the effects of cache coherence protocol state, and lock 
and barrier synchronization on the performance of multithreaded applications running on 
multiprocessor hardware, We first describe a simplified version of the algorithm, which 
uses semaphores to synchronize instrumented application threa ... 

1^ . Workload optimization: Implicit and exp lici t optinnizations for stencil computations | 
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, Katherine Yelick 
October 2006 Proceedings of the 2006 workshop on Memory system performance and 

correctness MSPC '06 
Publisher: ACM Press 

Full text available: ^ pdf(563.94 KB) Additional Information: full citation , abstract , references , index terms 

Stencil-based kernels constitute the core of many scientific applications on block- 
structured grids. Unfortunately, these codes achieve a low fraction of peak performance, 
due primarily to the disparity between processor and main memory speeds. We examine 
several optimizations on both the conventional cache-based memory systems of the 
Itanium 2, Opteron, and PowerS, as well as the heterogeneous multicore design of the 
Cell processor. The optimizations target cache reuse across stencil sweeps, in ... 

1^ Impr oved spill code generation for software pi pelined loop s | 
Javier Zaiamea, Josep Llosa, Eduard Ayguade, Mateo Valero 

May 2000 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2000 conference 
on Programming language design and implementation PLDI '00, Volume 35 

Issue 5 
Publisher: ACM Press 

Full text available: « Ddf(688.06 KB ) Additional Information: fuj citaiion. abstiact. references, citjngs, Lndex 
' terms 

Software pipelining is a loop scheduling technique that extracts parallelism out of loops by 
overlapping the execution of several consecutive iterations. Due to the overlapping of 
iterations, schedules impose high register requirements during their execution. A schedule 
is valid if it requires at most the number of registers available In the target architecture. If 
not, its register requirements have to be reduced either by decreasing the iteration . 
overlapping or by spilling registers t ... 

Keywords: instruction-level parallelism, register allocation, software pipelining, spill code 



15 Compilers II: Inter-procedural stacked register allocation for Itanium® like architecture 
Liu Yang, Sun Chan, G. R. Gao, Roy Ju, Guei-Yuan Lueh, Zhaoqing Zhang 
June 2003 Proceedings of the 17th annual international conference on 
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Supercomputing ICS '03 

Publisher: ACM Press 

Full text available: "Si pdf(478.2Q KB) Additional Information: full citation, abstract, references , dtings. Index 

terms 

A hardware managed register stack, Register Stack Engine (RSE), is innplemented in 
Itanium® architecture to provide a unified and flexible register structure to software. The 
compiler allocates each procedure a register stack frame with its size explicitly specified 
using an alloc instruction. When the total number of registers used by the procedures on 
the call stack exceeds the number of physical registers, RSE performs automatically 
register overflows and fills to ensure that the c ... 

Keywords: hot region, hotspot, inter-procedural stacked register allocation, quota 
assignment, register allocation 

Register allocation for software p i pelined loop s 
B. R. Rau, M. Lee, P. P. Tirumalai, M. S. Schlansker 

July 1992 ACM SIGPLAN Notices , Proceedings of the ACI^ SIGPLAN 1992 conference 
on Programming language design and implementation PLDI '92; volume 27 

Issue 7 
Publisher: ACM Press 

Full text available: 'pi pdf(1.84 MB) Additional Information: full citation , abstract , references , citings, index 
' ^ terms 

Software pipelining is an important instruction scheduling technique for efficiently . 
overlapping successive iterations of loops and executing them in parallel. This paper 
studies the task of register allocation for software pipelined loops, both with and without 
hardware features that are specifically aimed at supporting software pipelines. Register 
allocation for software pipelines presents certain novel problems leading to 
unconventional solutions, especially in the presence of hardware s ... 

17 Accelerator: usin g data par allelis m to program GPUs for general-purpose uses Q 
David Tarditi, Sidd Puri, Jose Oglesby 

October 2006 ACM SIGARCH Computer Architecture News , ACM SIGOPS Operating 
Systems Review , ACM SIGPLAN Notices , Proceedings of the 12th 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-XII, volume 34 , 40, 4i issue 5,5, 

11 . • 

Publisher: ACM Press 

Full text available: pdf(266.52 KB) Additional Information: full citation , abstract , references , index terms 

GPUs are difficult to program for general-purpose uses. Programmers can either learn 
graphics APIs and convert their applications to use graphics pipeline operations or they 
can use stream programming abstractions of GPUs. We describe Accelerator, a system 
that uses data parallelism to program GPUs for general-purpose uses instead. 
Programmers use a conventional imperative programming language and a library that 
provides only high-level data-parallel operations. No aspects of GPUs are exposed to ... 

Keywords: data parallelism, graphics processing units, just-in time compilation 

''S Resource constrained dataflow retiming heuristics for VLIW ASIPs Q 
M. Jacome, G. de Veciana, C. Akturan 

March 1999 Proceedings of the seventh international workshop on 

Hardware/software codesign CODES '99 
Publisher: ACM Press 

Full text available: ^ pdf(502.71 KB) Additional Information: full citation , refere nces , citings, Index terms 
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Alex Settle, Daniel A. Connors, Gerolf Hoflehner, Dan Lavery Q 
March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization CGO *03 

Publisher: IEEE Computer Society 

Full text available: 151 Ddf(906.60 KB) Additional Information: full citation , abstract, references , citings, index 

terms 

The Intel® Itanium® architecture contains a nunnber of innovative compiler-controllable 
features designed to exploit instruction level parallelism. New code generation and 
optimization techniques are critical to the application of these features to improve 
processor performance. For instance, the Itanium® architect;ure provides a compiler- 
controllable virtual register stacl< to reduce the penalty of memory accesses associated 
with procedure calls. The Itanium® Register Stack Engine ... 

20 Overlap ped loop su p port in the Cvdra 5 Q 
James C. Dehnert, Peter Y.-T. Hsu, Joseph P. Bratt 

April 1989 ACi^ SIGARCH Computer Architecture News , Proceedings of the third 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-III, volume i7 issue 2 

Publisher: ACM Press 

Full text available- ^ pdf d 32 MB) Additional Information: full citation, abstract, reference, citjngs, index 

* t erms 

The CydraTM 5 architecture adds unique support for overlapping successive iterations of a 
loop to a very long Instruction word (VLIW) base. This architecture allows highly parallel 
loop execution for a much larger class of loops than can be vectorized, without requiring 
the unrolling of loops usually used by compilers for VLIW machines. This paper discusses 
the Cydra 5 loop scheduling model, the special architectural features which support it, and 
thel... 
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Zhong Wang, Xiaobo Sharon Hu, Edwin H.-|v|. Sha 

January 2003 Proceedings of the 2003 conference on Asia South Pacific design 
automation ASPDAC 

Publisher: ACM Press 

Full text available: "g] pdfd 17.33 KB) Additional Information: full citation , abstract , references 

Increasing wire delays have become a serious problem for sophisticated VLSI designs. 
Clustered architecture offers a promising alternative to alleviate the problem. In the 
clustered architecture, the cache, register file and function units are all partitioned into 
dusters such that short CPU cycle time can be achieved. A key challenge is the 
arrangement of inter-cluster communication. In this paper, we present a novel algorithm 
for scheduling Inter-cluster communication operations. Our algorith ... 
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overlapping successive iterations of loops and executing them In parallel. This paper 
studies the task of register allocation for software pipelined loops, both with and without 
hardware features that are specifically aimed at supporting software pipelines. Register 
allocation for software pipelines presents certain novel problems leading to 
unconventional solutions, especially in the presence of hardware s ... 

Lifetinne-sensitive modulo schedulin g 
Richard A. Huff 

June 1993 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1993 conference 
on Programming language design and implementation PLDI '93, volume 28 

Issue 6 
Publisher: ACM Press 

Full text available: pdfd.OS MB^ Additional Information: full citation, abstract , references , citings, index 

^ terms , review 

This paper shows how to software pipeline a loop for minimal register pressure without 
sacrificing the loop's minimum execution time. This novel bidirectional slack-scheduling 
method has been implemented in a FORTRAN compiler and tested on many scientific 
benchmarks. The empirical results— when measured against an absolute lower bound on 
execution time, and against a novel schedule-independent absolute lower bound on 
register pressure— indicate ne ... 
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Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a 
variety of loops, resulting in high performance code but increased register requirements. 
We present a combined approach that schedules the loop operations for minimum register 
requirements, given a modulo reservation table. Our method determines optimal register 
requirements for machines with finite resources and for general dependence graphs. This 
method demonstrates the potential of lifetime-sen ... 
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As VLIW/EPIC processors are increasingly used in real-time, signal-processing, and 
embedded applications, the importance of minimizing code size and reducing power is 
growing. This paper describes a new architectural mechanism, called the Modulo Schedule 
Buffers, that provides an elegant interface for the execution of modulo scheduled loops. 
While the performance is similar to that of kernel-only modulo scheduling, this mechanism 
has a number of advantages, including minimal code expansion. Rath ... 
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Modulo scheduling is a very effective instruction scheduling technique that exploits 
Instruction Level Parallelism (ILP) in loop bodies by overlapping the execution of 
successive iterations. Unfortunately, modulo scheduling has been shown to cause heavy 
code expansion. To avoid the penalties of code expansion, some processors have 
dedicated hardware support for modulo scheduled loops. However, this dedicated 
hardware support' has a cost in chip area, cycle time, processor complexity, and 
compiler ... 
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Software pipelining of a multi-dimensional loop is an important optimization that overlaps 
the execution of successive outermost loop iterations to explore instruction-level 
parallelism from the entire n-dimensional iteration space. This paper investigates register 
allocation for software pipelined multi-dimensional loops. For single loop software 
pipelining, the lifetime instances of a loop variant in successive iterations of the loop form 
a repetitive pattern. An effective register allocation m ... 
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The paper proposes a novel software-pipelining algorithm. Register Sensitive Force 
Directed Retiming Algorithm (RS-FDRA), suitable for optimizing compilers targeting 
embedded VLIW processors. The key difference between RS-FDRA and previous 
approaches is that our algorithm can handle code size constraints along with latency and 
resource constraints. This capability enables the exploration of pareto "optimal" points 
with respect to code size and performance 
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Software pipelining technique is extensively used to exploit instruction-level parallelism of 
loops, but also significantly expands the code size. For embedded systems with very 
limited on-chip memory resources, code size becomes one of the most important 
optimization concerns. This paper presents the theoretical foundation of code size 
reduction for software-pipelined loops based on retiming concept. We propose a general 
Code-size REDuction technique (CRED) for various kinds of processors. Our ... 
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Predicated execution enables the removal of branches by convertingsegments of 
branching code into sequences of conditional operations, An Important side effect of this 
transformation is that thecompiler must unconditionally assign resources to predicated 
operations. However, a resource is only put to productive use whenthe predicate 
associated with an operation evaluates to True, To reducethis superfluous commitment of 
resources, we propose probabilisticpredicate-aware scheduling to assign multipl ... 
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Utilizing parallelism at the instruction level is an innportant way to improve performance. 
Because the time spent in loop execution dominates total execution time, a large body of 
optimizations focuses on decreasing the time to execute each iteration. Software 
pipelining is a technique that reforms the loop so that a faster execution rate is realized. 
Iterations are executed in overlapped fashion to increase parallelism. Let {ABC}n 
Keywords: instruction level parallelism, loop reconstruction, optimization, software 
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Tight data-and timing constraints are imposed by communication and multimedia 
applications. The architecture for the embedded processor imply resource constraints. 
Instead of random-access registers, relative location storages or rotating register files are 
used to exploit the available parallelism of resources by means of reducing the initiation 
interval in pipelined schedules. Therefore, the compiler or synthesis tool must deal with 
the difficult tasks of scheduling of operations and location ... 
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Modulo scheduling is a framework within which a wide variety of algorithms and heuristics 
may be defined for software pipelining innermost loops. This paper presents a practical 
algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine 
models. This paper also characterizes the algorithm in terms of the quality of the 
generated schedules as well the computational expense incurred. 

Keywords: instruction scheduling, loop scheduling, modulo scheduling, software 
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Software pipelining is an important instruction scheduling technique for efficiently 
overlapping successive iterations of loops and executing them in parallel. This paper 
studies the task of register allocation for software pipelined loops, both with and without 
hardware features that are specifically aimed at supporting software pipelines! Register 
allocation for software pipelines presents certain novel problems leading to 
unconventional solutions, especially in the presence of hardware s ... 
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Software pipelining of a multi-dimensional loop is an important optimization that overlaps 
the execution of successive outermost loop iterations to explore instruction-level 
parallelism from the entire n-dimensional iteration space. This paper investigates register 
allocation for software pipelined multi-dimensional loops, For single loop software 
pipelining, the lifetime instances of a loop variant In successive iterations of the loop form 
a repetitive pattern. An effective register allocation m ... 
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Utilizing parallelism at the instruction level is an important way to improve performance. 
Because the time spent in loop execution dominates total execution time, a large body of 
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optimizations focuses on decreasing the time to execute each Iteration. Software 
pipelining Is a technique that reforms the loop so that a faster execution rate is realized. 
Iterations are executed In overlapped fashion to Increase parallelism. Let {ABC}n 
Keywords: instruction level parallelism, loop reconstruction, optimization, software 
pipelining 
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Traditionally, software pipelining is applied either to theinnermost loop of a given loop 
nest or fronn the innermostloop to the outer loops. In a companion paper, we proposeda 
scheduling method, called Single-dimension SoftwarePipelining (SSP), to software pipeline 
a multi-dimensionalloop nest at an arbitrary loop level. In this paper, we describe our 
solution to SSP code generation. In contrast to traditional software pipelining, SSPhandles 
two distinct repetitive patterns, and thus requiresnew c ... 
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Software pipelining technique Is extensively used to exploit instruction-level parallelism of 
loops, but also significantly expands the code size. For embedded systems with very 
limited on-chip memory, resources, code size becomes one of the most important 
optimization concerns. This paper presents the theoretical foundation of code size 
reduction for software-pipelined loops based on retiming concept. We propose a general 
Code-size REDuction technique (CRED) for various kinds of processors. Our ... 
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Traditionally, software pipelining is applied either to the innermost loop of a given loop 
nest or from the Innermost loop to outer loops. This paper proposes a three-step 
approach, called single-dimension software pipelining (SSP), to software pipeline a loop 
nest at an arbitrary loop level that has a rectangular iteration space and contains no 
sibling inner loops in it. The first step identifies the most profitable loop level for software 
pipelining in terms of initiation rate, data ... . 
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Increasing wire delays have become a serious problem for sophisticated VLSI designs. 
Clustered architecture offers a promising alternative to alleviate the problem.. In the 
clustered architecture, the cache, register file and function units are all partitioned into 
clusters such that short CPU cycle tinie can be achieved. A key challenge is the 
arrangement of inter-cluster communication. In this paper, we present a novel algorithm 
for scheduling inter-cluster communication operations. Our algorith ... 
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The paper proposes a novel software-pipelining algorittim, Register Sensitive Force 
Directed Retiming Algoritlim (RS-FDRA), suitable for optimizing compilers targeting 
embedded VLIW processors. The key difference between RS-FDRA and previous 
approaches is that our algorithm can handle code size constraints along with latency and 
resource constraints. This capability enables the exploration of pareto ''optimal" points 
with respect to code size and performance 
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The TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To 
achieve peak performance, this VLIW architecture relies heavily on software pipelining. 
Traditionally, software pipelining has been restricted to regular (FOR) loops. More 
recently, software pipelining has been extended to irregular (WHILE) loops, but only on 
architectures that provide special-purpose hardware such as rotating (predicate and 
general-purpose) register files, specific instruct ... 
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The paper presents a method for translation validation of a specific optimization, software 
pipelining optimization, used to increase the instruction level parallelism in EPIC type of 
architectures. Using a methodology as in [15] to establish simulation relation between 
source and target based on computational induction, we describe an algorithm that 
automatically produces a set of decidable proof obligations. The paper also describes SPV, 
a prototype translation validator that automatica ... 
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The 64-bit Intel® Itaniunn^'^ architecture Is designed for high-performance scientific and 
enterprise computing, and the Itanium processor is its first silicon implementation. 
Features such as extensive arithmetic support, predication, speculation, and explicit 
parallelism can be used to provide a sound infrastructure for supercomputing. A large 
number of high-performance computer companies are offering Itanium^"-based systems, 
some capable of peak performance exceeding 50 GFLOPS, In ... 
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high-speed devices with the ability for custom computing whilst maintaining the flexibility 
of a software solution. These features are well suited to image processing algorithms that 
are computationally intensive and repetitive in nature. Very deep pipelining and 
parallelism, features often required for real time image analysis can be achieved easily 
using hardware design. The Hough Transform is a powerful and ... 
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As VLIW/EPIC processors are increasingly used in real-time, signal-processing, and 
embedded applications, the importance of minimizing code size and reducing power is 
growing. This paper describes a new architectural mechanism, called the Modulo Schedule 
Buffers, that provides an elegant interface for the execution of modulo scheduled loops. 
While the performance is similar to that of kernel-only modulo scheduling, this mechanism 
has a number of advantages, including minimal code expansion. Rath ... 
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Modern compiler transformations that eliminate redundant computations or reorder 
instructions, such as partial redundancy elimination and instruction scheduling, are very 
effective in improving application performance but tend to create longer and potentially 
more complex live ranges. Typically the task of dealing with the increased register 
pressure is left to the register allocator. Jo avoid Introduction of spill code which can 
reduce or completely eliminate the benefit of earlier optimization ... 
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Software pipelining is an aggressive scheduling technique that generates efficient code for 
loops and is particularly effective for VLIW architectures. Few software pipelining 
algorithms, however, are able to efficiently schedule loops that contain conditional 
branches. We have developed an algorithm we call All Paths Pipelining (APP) that 
addresses this'shortcoming of software pipelining. APP is designed to achieve optimal or 
near-optimal performance for any run of iterations while providing ef ... 

Code generation schema for modulo scheduled loop s 
B. Ramakrishna Rau, Michael S. Schlansker, P. P. Tirumalai 

December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 
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1 Attackin g the semantic ga p between application pro g rammin g lang uages and 
^ configurable hardware 

^ Greg Snider, Barry Shackleford, Richard J. Carter 

February 2001 Proceedings of the 2001 ACM/SIGDA ninth international symposium 
on Field programmable gate arrays FPGA '01 

Publisher: ACM Press 

Full text available* ^ Ddf(258 65 KB) Additional Information: full citation , abstract , references , citing s, index 

terms 

It is difficult to exploit the massive, fine-grained parallelism of configurable hardware with 
a conventional application programdming language such as C, Pascal or Java. The 
difficulty arises from the mismatch between the synchronous, concurrent processing 
capability of the hardware and the expressiveness of the landguage-the so-called 
"semantic gap." We attack this problem by using a programming rr\ode\ matched to the 
hardware's capabilities that can be implemented in any (unmodified) objec ... 



Software Pipelinin g Irreg ular Loops On the TMS320C6000 VLIW DSP Architectur e 
Elana Granston, Eric Stotzer, Joe Zbici.al< 

August 2001 ACM SIGPLAN Notices, Proceedings of the ACM SIGPLAN workshop on 
Languages, compilers and tools for embedded systems LCTES '01 , 
Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of 
middleware and distributed systems OM '01, volume 36 issue 8 

Publisher: ACM Press 

Full text available: pdf(88,54 KB) Additional Information: full citation , abstract , references , index terms 

The TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To 
achieve peak performance, this VLIW architecture relies heavily on software pipelining. 
Traditionally, software pipelining has been restricted to regular (FOR) loops. More 
recently, software pipelining has been extended to irregular (WHILE) loops, but only on 
architectures that provide special-purpose hardware such as rotating (predicate and 
general-purpose) register files, specific instruct ... 

Keywords: VLIW architectures, WHILE loops, digital signal processors, software 
pipelining 
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The research community has studied if-conversion for many years. However, due to the 
lack of existing hardware, studies were conducted by simulating code generated by 
experimental compilers. This paper presents the first comprehensive study of the use of 
predication to implement if-conversion on production hardware with a near-production 
compiler. To better understand trends in the measurements, we generated binaries at 
three increasing levels of if-conversion aggressiveness. For each level, we ... 

Compilers II: Inter-procedural stacked re g ister allocation for itanium® like architecture 
Liu Yang, Sun Chan, G. R. Gao, Roy Ju, Guei-Yuan Lueh, Zhaoqing Zhang 
June 2003 Proceedings of the 17th annual international conference on 

Supercomputing ICS '03 
Publisher: ACM Press 

Full text available: pdf(478 20 KB) ^^^'^'o^^' Information: full citation , abstract , references , citings, index 

. : terms 

A hardware managed register stacl<, Register Stack Engine (RSE), is implennented in 
Itanium® architecture to provide a unified and flexible register structure to software. The 
compiler allocates each procedure a register stack frame with Its size explicitly specified 
using an alloc instruction. When the total number of registers used by the procedures on 
the call stack exceeds the number of physical registers, RSE performs automatically 
register overflows and fills to ensure that the c ... 

Keywords: hot region, hotspot, inter-procedural stacked register allocation, quota 
assignment, register allocation 



Dependences and register allocation: Prematerialization: reducing register pressure 
for free 

Ivan D, Baev, Richard E. Hank, David H. Gross 

September 2006 Proceedings of the 15th international conference on Parallel 
architectures and compilation techniques PACT '06 

Publisher: ACM Press 

Full text available: "g] pdf( 159.54 KB) Additional Information: full citation , abstract , references , index terms 

Modern compiler transformations that eliminate redundant computations or reorder 
instructions, such as partial redundancy elimination and instruction scheduling, are very 
effective in improving application performance but tend to create longer and potentially 
more complex live ranges. Typically the task of dealing with the increased register 
pressure is left to the register allocator. To avoid introduction of spill code which can 
reduce or completely eliminate the benefit of earlier optimization ... 

Keywords: Itanium, VLIW, register allocation, register pressure, rematerialization 



EPIC compilation: Optimization for the Intel® Itanium® architecture register stack 
Alex Settle, Daniel A. Connors, Gerolf Hoflehner, Dan Lavery 

March 2003 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization CGO '03 

Publisher: IEEE Computer Society 
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Full text available: ' 



The Intel® Itanium® architecture contains a number of innovative compiler-controllable 
features designed to exploit instruction level parallelism. New code generation and 
optimization techniques are critical to the application of these features to improve 
processor performance. For instance, the Itanium® architecture provides a compiler- 
controllable virtual register stack to reduce the penalty of memory accesses associated 
with procedure calls. The Itanium® Register Stack Engine ... 
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Modulo scheduling is a very effective Instruction scheduling technique that exploits 
Instruction Level Parallelism (ILP) in loop bodies by overlapping the execution of 
successive Iterations. Unfortunately, modulo scheduling has been shown to cause heavy 
code expansion. To avoid the penalties of code expansion, some processors have 
dedicated hardware support for modulo scheduled loops. However, this dedicated 
hardware support has a cost In chip area, cycle time, processor complexity, and 
compiler ... 

8 Register allocation for software pi pelined loo ps Q 
^ B. R. Rau, M. Lee, P. P. Tirumalal, M. S. Schlansker 

V July 1992 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1992 conference 
on Programming language design and implementation PLDI '92, Volume 27 
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Publisher: ACM Press 

Full text available* pdf(1.84MB) Additional Information: full citation , abstract , references . ciMngs, index 
' F^" terms 

Software pipelining is an important instruction scheduling technique for efficiently 
overlapping successive iterations of loops and executing them in parallel. This paper 
studies the task of register allocation for software pipelined loops, both with and without 
hardware features that are specifically aimed at supporting software pipelines. Register 
allocation for software pipelines presents certain novel/problems leading to 
unconventional solutions, especially in the presence of hardware s ... 

9 Modulo scheduling: Modulo schedule buffers Q 
Matthew C. Merten, Wen-mei W. Hwu 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium 
on Microarchitecture MICRO 34 

Publisher: IEEE Computer Society 

Full text available: ^ . . . . S 

■[g.pciT(i.:ia Mb) ■■o Additional Information: full citation , abstract , references , citings 

Publisher Site 

As VLIW/EPIC processors are increasingly used in real-time, signal-processing/ and 
embedded applications, the importance of minimizing code size and reducing power is 
growing. This paper describes a new architectural mechanism, called the Modulo Schedule 
Buffers, that provides an elegant interface for the execution of modulo scheduled loops. 
While the performance is similar to that of kernel-only modulo scheduling, this mechanism 
has a number of advantages, including minimal code expansion. Rath ... 

10 Smalltalk-80: the lan guag e and it s im p lem entation Q 
Adele Goldberg, David Robson 

January 1983 Book 

Publisher: Addison-Wesley Longman Publishing Co., Inc. 

Full text available: ^ pdf(33.56 MB) Additional Information: full citation , abstract , cited by . index terms , review 
From the Preface (See Front Matter for full Preface) 

Advances in the design and production of computer hardware have brought many more 
people into direct contact with computers. Similar advances in the design and production 
of computer software are required In order that this increased contact be as rewarding as 
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possible. The Smalltalk-80 system is a result of a decade of research into creating 
computer software that is appropriate for producing highly functional and interactive ... 

11 An open-source CVE for pro g ramming education: a case study: An open-source CVE J 
for programming education: a case study 

Andrew M. Phelps, Christopher A. Egert, Kevin J. Bierre, David M. Parks 
July 2005 ACM SIGGRAPH 2005 Courses SIGGRAPH '05 
Publisher: ACM Press 

Full text available: ^pdf(7.92 MB) Additional Information: full citation , references 



12 Software pi pelinin g . 

^ Vicki H. Allan, Reese B. Jones, Randall M. Lee, Stephen J. Allan 

September 1995 ACM Computing Surveys (CSUR), volume 27 issue 3 
Publisher: ACM Press 

F II text available" "Pi Ddf(4 72 MB) Additional Information: full citation, abs tract, references, citings, index 
u XVI terms 

Utilizing parallelism at the instruction level is an important way to improve performance. 
Because the time spent in loop execution dominates total execution time, a large body of 
optimizations focuses on decreasing the time to execute each iteration. Software 
pipelining is a technique that reforms the loop so that a faster execution rate is realized. 
Iterations are executed in overlapped fashion to increase parallelism. Let {ABC}n 
Keywords: instruction level parallelism, loop reconstruction, optimization, software 
pipelining 



13 A programming language 
Kenneth E. Iverson 
January 1962 Book 

. Publisher: John Wiley & Sons, Inc. 

Additional Information: full citation , abstract , references , cited by . index terms 
From the Preface 



Applied mathematics Is largely concerned with the design and analysis of explicit 
procedures for calculating the exact or approximate values of various functions. Such 
explicit procedures are called algorithms or programs. Because an effective notation for 
the description of programs exhibits considerable syntactic structure, it is called a 
programming language. 

Much of applied mathematics, particularly the more recent computer-related areas 
which ... 



^4 Accelerator: using data parallelism to program GPUs for general- pur pose uses ^ 
David Tarditi, Sidd Purl, Jose Oglesby 

V October 2006 ACM SIGARCH Computer Architecture News , ACM SIGOPS Operating 
Systems Review , ACM SIGPLAN Notices , Proceedings of the 12th 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-XII, Volume 34 , 40 , 4i issue 5,5, 
11 

Publisher: ACM Press 

Full text available: Q pdf(266.52 KB) Additional Information: full citation , abstract , references , index terms 

GPUs are difficult to program for general-purpose uses. Programmers can either learn 
graphics APIs and convert their applications to use graphics pipeline operations or they . 
can use stream programming abstractions of GPUs. We describe Accelerator, a system 
that uses data parallelism to program GPUs for general-purpose uses instead. 
Programmers use a conventional imperative programming language and a library that 
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"•5 Artificial intel lig ence 
Elaine Rich 
January 1983 Book 

Publisher: McGraw-Hill, Inc. 

Additional Information: full citation , abstract , references , cited by . review 

The goal of this book is to provide programmers and computer scientists with a readable 
introduction to the problems and techniques of artificial intelligence (A.I.). The book can 
be used either as a text for a course on A.L or as a self-study guide for computer 
professionals who want to learn what A.I. is all about. 

The book was designed as the text for a one-semester, Introductory graduate course in 
A.I. In such a course, it should be possible to cover all of the material in the boo ... 

''^ Register allocation for software pipelined multi-dimensional loo ps 
Hongbo Rong, Alban Douillet, Guang R. Gao 

June 2005 ACM SIG PLAN Notices , Proceedings of the 2005 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '05, Volume 40 
Issue 6 
Publisher: ACM Press 

Full text available: 'gpdf( 745.17 KB) Additional Information: full citation , abstract , references , index terms 

Software pipelining of a multi-dimensional loop is an important optimization that overlaps 
the execution of successive outermost loop iterations to explore instruction-level 
parallelism from the entire n-dimensional iteration space. This paper investigates register 
allocation for software pipelined multi-dimensional loops. For single loop software 
pipelining, the lifetime instances of a loop variant in successive iterations of the loop form 
a repetitive. pattern. An effective register allocation m ... 

Keywords: register allocation, software pipelining 



'^'^ Single-dimension software pi pelinin g for multidimensional loo ps Q 
Hongbo Rong, Zhizhong Tang, R. Govlndarajan, Alban Douillet, Guang R. Gao 
March 2007 ACM Transactions on Architecture and Code Optimization (TACO), volume 4 

Issue 1 
Publisher: ACM Press 

Full text available: ^ pdf(439.64 KB) Additional Information: full citation , abstract , references , index terms 

Traditionally, software pipelining is applied either to the innernnost loop of a given loop 
nest or from the innermost loop to outer loops. This paper proposes a three-step 
approach, called single-dimension software pipelining (SSP), to software pipeline a loop 
nest at an arbitrary loop level that has a rectangular iteration space and contains no 
sibling inner loops in it. The first step identifies the most profitable loop level for software 
pipelining in terms of Initiation rate, data ... 

Keywords: Software pipelining, loop transformation, modulo scheduling 
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Real-time procedural shading was once seen as a distant dream. When the first version of 
this course was offered four years ago, real-time shading was possible, but only with one- 
of-a-kind hardware or by combining the effects of tens to hundreds of rendering passes. 
Today, almost every new computer comes with graphics hardware capable of interactively 
executing shaders of thousands to tens of thousands of instructions. This course has been 
redesigned to address today's real-time shading capabiil ... 



O ptinnal register reassignment for register stack overflow minimization 
Yoonseo Choi, Hwansoo Han 

March 2006 ACM Transactions on Architecture and Code Optimization (TACO), volume 3 

Issue 1 
Publisher: ACM Press 

Full text available: ^ pdff584.90 KB) Additional Information: full citation , abstract , references , index terms 

Architectures with a register stack can implement efficient calling conventions. Using the 
overlapping of callers' and callees' registers, callers are able to pass parameters to callees 
without a memory stack. The most recent instance of a register stack can be found in the 
Intel Itanium architecture. A hardware component called the register stack engine (RSE) 
provides an illusion of an Infinite-length register stack using a memory-backed process to 
handle overflow and underflow for a physically ... 

Keywords: Register assignment, register allocation, register stack, sequence graph 
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Measurement and evaluation of the MIPS architecture and processor 
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Publisher: ACM Press 

Additional information: full citation , abstract , references , citin gs, index 
terms , review 



Full text available: ■g Ddf(2.3Q MB) 



MIPS is a 32-bit processor architecture that has been implemented as an nMOS VLSI chip. 
The instruction set architecture is RISC-based. Close coupling with compilers and efficient 
use of the instruction set by compiled programs were goals of the architecture. The MIPS 
architecture requires that the software implement some constraints in the design that are 
normally considered part of the hardware implementation. This paper presents 
experimental results on the effectiveness of this processor ... 
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